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■ We introduce bounds on the finite-time performance of Markov chain Monte Carlo algorithms in ap- 

proaching the global solution of stochastic optimization problems over continuous domains. 
A comparison with other state-of-the-art methods having finite-time guarantees for solving stochastic 
programming problems is included. 
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I. Introduction 



CN ■ In principle, any optimization problem on a finite domain can be solved by an exhaustive 

. search. However, this is often beyond computational capacity: the optimization domain of the 

I traveling salesman problem with 100 cities contains more than 10^^^ possible tours [1]. An 

\Q • efficient algorithm to solve the traveling salesman and many similar problems has not yet 

o ; 

0\ ! been found and such problems remain solvable only in principle BH. Statistical mechanics 



has inspired widely used methods for finding good approximate solutions in hard discrete 



^ . optimization problems which defy efficient exact solutions [ISl-O. Here a key idea has been 
I that of simulated annealing [3|: a random search based on the Metropolis-Hastings algorithm, 
such that the distribution of the elements of the domain visited during the search converges to an 
equilibrium distribution concentrated around the global optimizers. Convergence and finite-time 
performance of simulated annealing on finite domains have been evaluated e.g. in iTTll- lfTOl . 

On continuous domains, most popular optimization methods perform a local gradient-based 
search and in general converge to local optimizers, with the notable exception of convex op- 
timization problems where convergence to the unique global optimizer occurs [[TTl . Simulated 
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annealing performs a global search and can be easily implemented on continuous domains using 
the general family of Markov chain Monte Carlo (MCMC) methods lfT2ll . Hence it can be 
considered a powerful complement to local methods. In this paper, we introduce for the first 
time rigorous guarantees on the finite-time performance of simulated annealing on continuous 
domains. We will show that it is possible to derive MCMC algorithms to implement simulated 
annealing which can find an approximate solution to the problem of optimizing a function of 
continuous variables, within a specified tolerance and with an arbitrarily high level of confidence 
after a known finite number of steps. Rigorous guarantees on the finite-time performance of 
simulated annealing in the optimization of functions of continuous variables have never been 
obtained before; the only results available state that simulated annealing converges to a global 
optimizer as the number of steps grows to infinity, e.g. llT3l - lfT7l . asymptotic convergence rates 
have been obtained in [[IHl, f\M . 

The background of our work is twofold. On the one hand, our definition of "approximate 
domain optimizer", introduced in Section |ll] as an approximate solution to a global optimization 
problem, is inspired by the definition of "probably approximate near minimum" introduced 
by Vidyasagar in EOl for global optimization based on the concept of finite-time learning 
with known accuracy and confidence of statistical learning theory ll2TI . [|22ll . In the control 
field the work of Vidyasagar [|20l . Il22l has been seminal in the development of the so-called 
randomized approach. Inspired by statistical learning theory, this approach is characterized by the 
construction of algorithms which make use of independent sampling in order to find probabilistic 
approximate solutions to difficult control system design applications see e.g. [|23l - ll25l and the 
references therein. In our work, the definition of approximate domain optimizer will be essential 
in establishing rigorous guarantees on the finite-time performance of simulated annealing. On the 
other hand, we show that our rigorous finite-time guarantees can be achieved by the wider class 
of algorithms based on Markov chain Monte Carlo sampling. Hence, we ground our results on the 
theory of convergence, with quantitative bounds on the distance to the target distribution, of the 
Metropolis-Hastings algorithm and MCMC methods [|26ll - [|29ll . In addition, we demonstrate how, 
under some quite weak regularity conditions, our definition of approximate domain optimizer 
can be related to the standard notion of approximate optimization considered in the stochastic 
programming literature ll30ll - [[33l . This link provides theoretical support for the use of simulated 
annealing and MCMC optimization algorithms, which have been proposed, for example, in ll34l - 
||36]| . for solving stochastic programming problems. In this paper, beyond the presentation of some 
simple illustrative examples, we will not develop any ready-to-use optimization algorithm. The 
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Metropolis-Hastings algorithm and the general family of MCMC methods have many degrees of 
freedom. The choice and comparison of specific algorithms goes beyond the scope of the paper. 

The paper is organized as follows. In Section |II] we introduce the definition of approximate 
domain optimizer and establish a direct relationship between the approximate domain optimizer 
and the standard notion of approximate optimizer adopted in the stochastic programming lit- 
erature. In Section |lll] we first recall the reasons why existing results on the convergence of 
simulated annealing on continuous domain do not provide finite-time guarantees. Then we state 
the main results of the paper and we discuss their consequences. In Section |IV] we illustrate 
the convergence of MCMC algorithms. In Section |V] we present a simple illustrative numerical 
example. In Section |VI] we compare the MCMC approach with other state-of-the-art methods 
for solving stochastic programming problems with finite-time performance bounds. In Section 
IVIII we state our findings and conclude the paper. The Appendix contains all technical proofs. 
Some of the results of this paper were included in preliminary conference contributions [3V], 

m. 

II. Approximate optimizers 
Consider an optimization criterion U : 9 — M, with 9 C R", and let 

U* ■.= snpU{e). (1) 

The following will be a standing assumption for all our results. 

Assumption 1: 9 has finite Lebesgue measure. U is well defined point- wise, measurable, and 
bounded between and 1 (i.e. U{9) e [0, 1] e 9). 

In general, any bounded criterion can be scaled to take values in [0, 1]. Given, for example, 
U'{9) E [U_, U] we can consider the optimization of the modified function 

^ ' u-u 

which takes values in [0,1] for all ^ G 9. (In this case, we need to multiply the value imprecision, 
e below, by {U — IP) to obtain its corresponding value in the scale of the original criterion U' .) 
For some results another assumption will be needed. 

Assumption 2: 9 is compact. U is Lipschitz continuous. 

We use L to denote the Lipschitz constant of U, i.e. V6'i,6'2 G 9, \U{6i) - U{62)\ < L||6'i -6'2||. 
Assumption |2] implies the existence of a global optimizer, i.e. under Assumption [2l we have 

e* := G 9 I u{e) = u*} ^ 0. 
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If, given an element 6 in 6, the value U{6) can be computed directly, we say that U is 
a deterministic criterion. In this case the optimization problem ([T]) is a standard, in general 
non-linear, non-smooth, programming problem. Examples of such a deterministic optimization 
criterion are, among many possible others, the design criterion in a robust control design problem 
EOl and the energy landscape in protein structure prediction |[39ll . In problems involving random 
variables, the value U{6) can be the expected value of some function : x X — R which 
depends on both the optimization variable 6, and on some random variable x with probability 
distribution Px{-] 0) which may itself depend on Q, i.e. 



In such problems it is usually not possible to compute U {6) directly. In stochastic optimization 
[|30l - [l32l . [[34l - [[36l . [|40l . it is typically assumed that one can obtain independent samples of 
X for a given 6, hence obtain sample values of g(x,9), and thus construct a Monte Carlo 
estimate of U{9). In some application it might not be possible or efficient to obtain independent 
samples of x. In this case one has to resort to other Monte Carlo strategies to approximate U (9) 
such as, for example, importance sampling [[T2|. The Bayesian experimental design of clinical 
trials is an important application area where expected-value criteria arise [J4T|. We investigate 
the optimization of expected-value criteria motivated by problems of aircraft routing p2l and 
parameter identification for genetic networks [|43l . In the particular case that Px(dx;9) does 
not depend on 9, the 'inf counterpart of problem ([T]) is called "empirical risk minimization", 
and is studied extensively in statistical learning theory [|2TI . [|22l . Conditions on g and to 
ensure that U is Lipschitz continuous (for Assumption |2l) can be found in OTl pag. 189-190]. 
The results reported here apply in the same way to the optimization of both deterministic and 
expected-value criteria. 

We introduce two different definitions of approximate solution to the optimization problem ([T]). 
The first is the definition of approximate domain optimizer. It will be essential in establishing 
finite-time guarantees on the performance of MCMC methods. 

Definition 1: Let e > and a E [0, 1] be given numbers. Then 9 is an approximate domain 
optimizer of U with value imprecision e and residual domain a if 



where A denotes the Lebesgue measure. 

That is, the function U takes values strictly greater than U{9) + e only on a subset of values of 
9 no larger than an a portion of the optimization domain. The smaller e and a are, the better 




(2) 



A({^' e e : U{9') > U{9) + e}) < a A(e) 



(3) 
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is the approximation of a true global optimizer. If both a and e are equal to zero then U {6) 
coincides with the essential supremum of U ll44l . We will use 



to denote the set of approximate domain optimizers with value imprecision e and residual domain 
a. The intuition that our notion of approximate domain optimizer can be used to obtain formal 
guarantees on the finite-time performance of optimization methods based on a stochastic search 
of the domain is already apparent in the work of Vidyasagar. Vidyasagar EOll . [[22| introduces 
the similar definition of "probably approximate near minimum" and obtains rigorous finite-time 
guarantees in the optimization of expected value criteria based on uniform independent sampling 
of the domain. The method of Vidyasagar has had considerable success in solving difficult control 
system design applications Il20ll . ||23l . Its appeal stems from its rigorous finite-time guarantees 
which exist without the need for any particular assumption on the optimization criterion. 
The following is a more common notion of approximate optimizer. 

Definition 2: Let e > be a given number. Then 6 is an an approximate value optimizer of 
U with imprecision e if U{e') < U{9) + e for all 9' G 6. 

This notion is commonly used in the stochastic programming literature Il30l - ll32l . Il40ll and 
provides a direct bound on U*: 9 E Q is an approximate value optimizer with imprecision e > 
if and only if U* < U{9) + e. We will use 



to denote the set of approximate value optimizers with imprecision e. 

It is easy to see that for all e if 9* 7^ then 0* C 0*(e). Notice that 6*(e) does not coincide 
with 0(e,a). In fact one can see that approximate value optimality is a stronger concept than 
approximate domain optimality, in the following sense. For all e and all a, if 6*(e) 7^ then 
0*(e) C 6(e,a). Conversely, given an approximate domain optimizer it is in general not possible 
to draw any conclusions about the approximate value optimizers. For example, for any a the 
function U : [0, 1] [0, 1] with 



has the property that 9(e, a) = for all e > 0. Therefore, given 9 G 9(e,a) it is impossible 
to draw any conclusions about U*; the only possible bound is U* <U{9) + 1 which, given that 



e(e, a):={9ee\ X{{9' G 6 | U{9') > U{9) + e}) < aX{e)} 



e*(e) ■.= {9eQ\ \/9' G e, U{9') < U{9) + e} 
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U{6) G [0, 1], is meaningless. A relation between domain and value approximate optimality can, 
however, be established under Assumption [21 

Theorem 1: Let Assumption [21 hold. Let 9 be an approximate domain optimizer with value 
imprecision e and residual domain a. Then, 9 is also an approximate value optimizer with 
imprecision 



e + 

where T denotes the gamma function. 
Theorem \\\ shows that 



9 E e(e, a)^U* < U{9) + e + ^ 



/n 
2 V2 



(4) 



The result allows us to select the value of a in such a way that an approximate domain optimizer 
with value imprecision e and residual domain a is also an approximate value optimizer with 
imprecision 2e. To do this, we need to select a so that ^ [|r (|)] " [aA(0)]" < e hence 

-1 n 



a < 



(5) 



m [tr(f)] ■ 

To illustrate the above inequality consider the case where the domain is contained in an n- 
dimensional ball of radius R. Notice that under Assumption [2l the existence of such an R is 

n 

guaranteed. In this case A (6) = ^^tIt-R" . Therefore ^ becomes 



, 1 e 
-'-'-LB. 



(6) 



Note that, as n increases, a has to decrease to zero rapidly to ensure the required imprecision 
of the approximate value optimizer. In this case, a needs to decrease to zero as e". 



III. Optimization with MCMC: finite time guarantees 

In simulated annealing, a random search based on the Metropolis-Hastings algorithm is carried 
out, such that the distribution of the elements of the domain visited during the search converges 
to an equilibrium distribution concentrated around the global optimizers. 

Here we adopt equilibrium distributions defined by densities proportional to \U {9) + 5Y , where 
J and 5 are strictly positive parameters. We use 



7i{d9;J,6) oc [U{9) + SY\{d9) 



(7) 



to denote this equilibrium distribution. The presence of 5 is a technical condition required in the 
proof of our main result and will be discussed later on in this section. In our setting, the so-called 
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Algorithm I : MCMC for deterministic criteria 

Assume that the cmTent state of the chain is Ok . 

1 Generate a proposed state 6k+i according to ilg{6\0k)- 

2 Calculate the acceptance probability 



p = mm 



u(0k+i\ek) [u{0k) + sy 



3 With probability p, accept the proposed state and set 0k+i = Ok+i- Otherwise leave the current state 
unchanged, i.e. set 0k+i = 0k- 

Algorithm II : MCMC for expected-value criteria 

Assume that the current state of the chain is [0k, {a:^^' \j = ■ ■ , J}] where {a;^^' \j = 1, . . . , ./} are 
J independent extractions generated according to P^(dx\0k)- 

1 Propose a new state {x'k+i\j = ■ ■ ■ > ''l ] where 0k+i is generated according to %(d\0k) and 

2 Calculate the acceptance probability 

J 



= li ■ • ■ I -^j are ,/ independent extractions generated according to Pa.(dx; 0k+i)- 



qe(0k\0k+i) j=i 



<lg[0k+l\0k 



,1 



\[[g{x^^\0k) + 5] 



3 With probability p, accept the proposed state and set 0k+i = 0k+i and ~ ^t+ib = 1, . . . , J}. 



Otherwise leave the current state unchanged, i.e. set 0k+i = 0k and {^Ai+i = ^k'\j = 1 ■-'}• 



Fig. 1. The basic iterations of the Metropolis-Hastings algorithm with equilibrium distributions 7r(-; J, 5) for the maximization 
of deterministic and expected-value criteria. In both algorithms, qg{-\0k) is the density of the 'proposal distribution'. 



'zero-temperature' distribution is the limiting distribution 7r( ■ ; J,6) for J — oo denoted by tToo- 
It can be shown that under some technical conditions, is a uniform distribution on the set 
0* of the global maximizers of U ||45l . 

In Fig. [H we illustrate two algorithms which implement Markov transition kernels with 
equilibrium distributions 7r( ■ ;J,6). Algorithm I is the 'classical' Metropolis-Hastings algorithm 
for the case in which U is a deterministic criterion. Algorithm II is a suitably modified version 
of the Metropolis-Hastings algorithm for the case in which U is an expected-value criterion in 
the form of (HI). This latter algorithm was devised by Miiller [|34ll . [[361 and Doucet et al. [[351 . 

In the simulated annealing scheme, one would simulate an inhomogeneous chain in which the 
Markov transition kernel at the A;-th step of the chain has equilibrium distribution 7r(- ;Jk,S) 
where {Jk}k=i,2,... is a suitably chosen 'cooling schedule', i.e. a non-decreasing sequence of 
values for the exponent J. The rationale of simulated annealing is as follows: if the temperature 
is kept constant, say Jk = J, then the distribution of the state of the chain, say Pq^, tends to the 
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equilibrium distribution 7r( ■ ; J, 5); if J — oo then the equilibrium distribution 7r( ■ ; J, 5) tends 
to the zero-temperature distribution ttoo; as a result, if the cooling schedule Jk tends to oo, one 
obtains that the distribution of the state of the chain Pg^^ tends to tToo lfT3l - lfT7l . 

The difficulty which must be overcome in order to obtain finite step results on simulated 
annealing algorithms on a continuous domain is that usually, in an optimization problem defined 
over continuous variables, the set of global optimizers 6* has zero Lebesgue measure (e.g. a set 
of isolated points). Notice that this is not the case for a finite domain, where the set of global 
optimizers is of non-null measure with respect to the reference counting measure iTTll- lfTOll . It is 
instructive to look at the issue in terms of the rate of convergence to the target zero-temperature 
distribution. On a continuous domain, the standard distance between two distributions, say yUi 
and ii2, is the total variation distance — yU2||Tv = sup^g(g(@) — ;U2(A)|. If the set of 

global optimizers 9* has zero Lebesgue measure, then the target zero-temperature distribution 
TToo ends up being a mixture of probability masses on 6*. On the other hand, the distribution 
of the state of the chain Pg^ is absolutely continuous with respect to the Lebesgue measure 
(i.e. \{A) = ^ Pe,,{A) = 0) by construction for any finite k. Hence, if 6* has zero Lebesgue 
measure then it has zero measure also according to Pg^^. The set 0* has however measure 1 
according to tTqc. The distance \\Pe^ — ttooHtv is then constantly 1. In general, on a continuous 
domain, although the distribution of the state of the chain Pq^ converges asymptotically to tToo, 
it is not possible to introduce a sensible distance between Pg^ and tToo and a rate of convergence 
to the target distribution cannot even be defined (weak convergence), see ffT3l Theorem 3.3]. 

Weak convergence to tt^ implies that, asymptotically, 6^ eventually hits the set of approximate 
value optimizers 0*(e), for any e > 0, with probability one lfT3l - [[T7l . In more recent works, 
bounds on the expected number of iterations before hitting 9*(e) [fT8| or on Pg^(Q*{e)) |fT9l 
have been obtained. In (19], a short review of existing bounds is proposed, and under some 
technical conditions, it is proven that for any e > there is a number such that P0,.{{O G 
I U{9) <U*- e}) < C,k-s{l + log k). In general, the expressions in these bounds cannot be 
computed. For example, in the bound reported here, is not known in advance. Hence, existing 
bounds can be used to asses the asymptotic rate of convergence but not as stopping criteria. 

Here we show that finite-time guarantees for stochastic optimization by MCMC methods on 
continuous domains can be obtained by selecting a distribution 7r( ■ ;J,6) with a finite J as the 
target distribution in place of the zero-temperature distribution tToq. Our definition of approximate 
domain optimizer given in Section |n] is essential for establishing this result. The definition of 
approximate domain optimizers carries an important property, which holds regardless of what 
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the criterion U is: if e and a have non-zero values then the set of approximate global optimizers 
0(e,a) always has non-zero Lebesgue measure. The following theorem establishes a lower 
bound on the measure of the set 0(e, a) with respect to a distribution %{■; J, 5) with finite J. 
It is important to stress that the result holds universally for any optimization criterion U on a 
bounded domain. The only minor requirement is that U takes values in [0, 1]. 

Theorem 2: Let Assumption [T] hold. Let 9(e, a) be the set of approximate domain optimizers 
of U with value imprecision e and residual domain a. Let J > 1 and 5 > 0, and consider the 
distribution Tr{d9; J, 5) oc [U {9) + 5Y X{d6) . Then, for any a E (0, 1] and e G [0, 1], the following 
inequality holds 

1 



7r(e(e,a); J,6) > 



1 + 



1 + 6 



-I J r 



11 + 6 

a € + 6 



1 + 6 
6 



(8) 



Notice that, for given non-zero values of e, a, and 6 the right-hand side of ([8]) can be made 
arbitrarily close to 1 by choice of J. To obtain some insight on this choice it is instructive 
to turn the bound of Theorem [2] around to provide a lower bound on J which ensures that 
7r(6(e, a); J, 6) attains some desired value a. 

Corollary 3: Let the notation and assumptions of Theorem [21 hold. For any a E (0, 1], e G 

(0, 1] and a E (0, 1), if 



a ,1 .1+6 
log h log — 1-2 log 



(9) 



1 — 0" a 6 
then 7r(6(e, a); J, 6) > a. 

The importance of the choice of a target distribution 7r( ■ ; J, 5) with a finite J is that the distance 
ll-Pofe ~7r( ■ ; J, 5) 1 1 TV is a meaningful quantity. Convergence of the Metropolis-Hastings algorithm 
and MCMC methods in total variation distance is a well studied problem. The theory provides 
simple conditions under which one derives upper bounds on WPe^ — %{■ ; J, (5)||tv that decrease 
to zero as A; — 7- oo [|26l - [|29l . It is then appropriate to introduce the following finite-time result. 

Proposition 4: Let the notation and assumptions of Theorem [2] hold. Assume that J respects 
the bound of Corollary |3] for given a, e, 6 and a. Let 0^ with distribution Pe^ be the state of 
the chain of an MCMC algorithm with target distribution 7r( ■ ;J,6). Then, 

Pe, (e(e, «); J, 5) > a - ||P,, - 7r( ■ ; J, 5) H^v . 

In other words, the statement "0^ is an approximate domain optimizer of U with value impre- 
cision e and residual domain a" can be made with confidence a — \\Pe^. — vr( • ; J, 
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The proof follows directly from the definition of the total variation distance. 

If the optimization criterion is Lipschitz continuous, Theorem |2] can be used together with 
Theorem \T\ to derive a lower bound on the measure of the set of approximate value optimizers 
with a given imprecision with respect to a distribution 7r( ■ ; J,S). An example of such a bound 
is the following. 

Proposition 5: Let the notation and assumptions of Theorems [U and [2] hold. In addition, 
assume that is contained in an n-dimensional ball of radius R. Let 6k with distribution Pg^ 
be the state of the chain of an MCMC algorithm with target distribution 7r( ■ ;J,5). For given 
e e (0, 1] and a e (0, 1), if 



a , fLR\ , 1 + 6' 

log + n log I — I + 2 log 



(10) 



then 

P«,(e*(2e); J, 5) > a - \\Pg^ - vr( ■ ; J, <5)||,v • 

In other words, the statement "6^ is an approximate value optimizer of U with value imprecision 
2e" can be made with confidence cr — WPe^. — 7r( ■ ; ^, 5)||Tv• 
The proof follows by substituting a with the right-hand side of (l6l) in dH) and from the definition 
of the total variation distance. 

Finally, Theorem |2] provides a criterion for selecting the parameter 5 in 7r( ■ ; J, 5). For given 
e and a, there exists an optimal choice of 5 which minimizes the value of J required to ensure 
7r(6(e, a); J, 5) > a. The advantage of choosing the smallest J, consistent with the required 
a, is computational. The exponent J coincides with the number of Monte Carlo simulations of 
random variable x which must be done at each step in Algorithm II. The smallest J reduces also 
the peakedness of 7r(-; J, 5). The higher the peakedness of 7r(-; J, 5) is the harder is to design a 
proposal distribution which operates efficiently. In turn, reducing the peakedness of it J, S) will 
decrease the number of steps required to achieve the desired reduction of WPe^. — n{- ; J, (5)||tv 
The optimal choice of 5 is specified by the following result. 

Proposition 6: For fixed e > 0, a > 0, and a E (0.5, 1), the function 



log h log — 1-2 log ■ 



1 — 0" a S 

i.e. the right hand side of inequality Q, is convex in 5 and attains its global minimum at the 
unique solution (for 5) of the equation 



, 1 + ^^, , 1 l + e + 6 

+ log -7== + log ^ = ..^ , ■ 
V 1 - cr ^/a d{l + d) 



11 




For example, if e = 0.01, a = 0.01 and a = 0.99, then one obtains 6 = 0.15 and J = 1540. 
Plots of the value of the optimal 5 and of the corresponding value of J for different values of 
e, a and a are shown in Fig. |2l Notice that the result of Proposition [6] holds also for inequality 
(flOl ) provided that a in the statement of Proposition |6] is replaced by the right hand side of 

IV. Convergence 

In this section we illustrate the statement of Proposition |4l We base the discussion on the 
simplest available result on the convergence of MCMC methods in total variation distance, taken 
from [|28l . In this case, the proposal distribution, denoted by its density qQ(6\0k) in Algorithms 
I and II, is independent of the current state Ok. 

Theorem 7 ( ^M): Let Pg^, be the distribution of the state of the chain in the Metropolis- 
Hastings algorithm with an independent proposal distribution. Let vr denote the target distribution. 
Let p and q denote respectively the density of tt and the density of the proposal distribution 
and assume that p{e) > 0, V6' G 9 and q{e) > 0,^6 e 9. If there exists M such that p{6) < 
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Mq{9) , WeQ, then 

IItt - Pz, II < ( 1 - 

M 



7r-PeX.< (1-^) • (11) 



Proof: See [IM Theorem 2.1], or d Theorem 7.8]. 

Here, we chose as the uniform distribution over 9. Sampling using an independent uniform 
proposal distribution is a naive strategy in an MCMC approach and cannot be expected to 
perform efficiently [[T2l|. However, it allows us to present some simple illustrative examples 
where convergence bounds can be derived with a few basic steps. 

In some cases the naive strategy can produce approximate domain optimizers very efficiently. 
One such case occurs under the assumption that the optimization criterion U {9) has a "fiat top", 
i.e. the set of global optimizers 6* has non-zero Lebesgue measure. The same assumption has 
been used in [[T3l Theorem 4.2] to obtain the strong convergence of simulated annealing on a 
continuous domain. In this case, the application of Theorem |7] provides the following result. 

Proposition 8: Let the notation and assumptions of Proposition |4] hold. In particular, assume 
that dk is the state of the chain of the Metropolis-Hastings algorithm with independent uniform 
proposal distribution. In addition, given p E (0, 1), let a = (1 +7)p for some 7 G (0, ^-^)- Let 
6* be the set of global optimizer of U and assume that A(9*) > /3A(6) for some /3 e (0, 1). If 

-log(l-/3) ^^'^ 

then P0,(e(e,a); J,5) > p. 

In (fT2)) . it is convenient to choose 7 ~ Hence, the number of iterations grows approximately 
as — log(l — p) = \og{j^) and — iog(|_^) and is independent of e and a. In Algorithm II the 
total number of required samples of x is given by the number of iterations multiplied by J. In 
this case, it can be shown that a nearly optimal choice is 7 = Hence, using ^ for the 

case of approximate domain optimization, we obtain that the required samples of x grow as ^, 
log^, and approximately as (logj^^)^. Instead, using (flOl) for the case of approximate value 
optimization, we obtain that the required samples of x grow as ^ log ^, (log j^)"^, log LR and n. 

If the 'flat top' condition is not met it can be easily seen that the use of a uniform proposal 
distribution can lead to an exponential number of iterations. The problem is the implicit depen- 
dence of the convergence rate on the exponent J. In the general case, by applying Theorem |7] 
we obtain the following result. 

Proposition 9: Let the notation and assumptions of Proposition |4] hold. In particular, assume 
that dk is the state of the chain of the Metropolis-Hastings algorithm with independent uniform 
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proposal distribution. In addition, given p G (0, 1), let a = (1 + 7)p for some 7 G (0, ^-y-)- If 
k > {^yiog (^-^^ or, equivalently. 



k > 



(l + 7)p 1/1 + 5 



log— (13) 

7P 



1 — (1 + 7)p a \ 5 
then Pg^{e{e, a); J, 6) > p. 

Hence, the number of iterations turns out to be exponential in ^. In Algorithm II, the total 
number of required extractions of x grows like Jk, which is also exponential in K Therefore, 
using Theorem|7]for Algorithms I and II with as the independent uniform proposal distribution, 
the only general bounds that we can guarantee are exponential. 

V. Numerical example 

To demonstrate some of the bounds derived in this work we apply the proposed method to a 
simple example. Let 9 E Q = [—3, 3] x [—3, 3] and consider the function 

v{e) = 3(1 - ^i)'e-''-('^+^)' - io(^ -el- el)e-'"-'" - -e-^^^^^^'-''^ 

(the Matlab function peaks). We define the function U : 9 — )■ [0, 1] by 

The scaling factor maxg/ge = 8.1062 and a Lipschitz constant of U{9), L = 1.725, were 

computed numerically using a grid on 6. The function U and its level sets are shown in Fig. [3l 
The 0.9 level set, which coincides with 6*(0.1), is highlighted in the figure. 

To obtain a stochastic programming problem multiplicative noise was added using the function 

g{x,e) = {i + x)u{e) 

where x is normally distributed with mean and variance 0.25. It is easy to see that the expected 
value of g{x,9) is indeed equal to U{9). One can think of g{x,9) as an imperfect, unbiased 
measurement device of U{9): We can only collect information about U through noise corrupted 
samples generated by g. Notice that the noise intensity is higher in areas where U{9) is large, 
making it more difficult to use the samples to pinpoint the maxima of U. 

The MCMC Algorithm II of Fig. [T] was applied to this function. The design parameter 5 = 0.1 
and an independent uniform proposal distribution q were used throughout. 

To demonstrate the convergence of the algorithm, 2, 000 independent runs of the algorithm, of 
10, 000 steps each, were generated. We then computed the fraction of runs that found themselves 
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Fig. 3. Function U{6) (left panel) and its level sets (right panel). The 0.9 level set is highlighted as a dashed ellipse. 

in 6* (0.1) at different time points; for simplicity we refer to this fraction as the 'success rate'. 
The results for different values of J are reported in the left panel of Fig. lU It is clear that in all 
cases the success rate quickly settles to a steady state value, suggesting that the algorithm has 
converged. Moreover, the steady state success rate increases as J increases. In the right panel 
of Fig. m we concentrate on the case J = 100 and plot in a logarithmic scale the absolute value 
of the difference between the success rate at different time points and the steady state success 
rate. According to Theorem Ul one would expect this difference to decay to geometrically at 
a rate 1 — jj. For comparison purposes, the corresponding curve for the numerically estimated 
value M = 1475, is also plotted on the figure. The bound of Theorem |7] indeed appears to be 
valid, albeit, in this case, conservative. 

To demonstrate the bound of Proposition [5] the steady state success rate as a function of 
the exponent J is reported in Fig. [5l more precisely, the figure shows the decay of 1 minus 
the steady state success rate as a function of J in linear and logarithmic scales. The figure also 
shows the corresponding theoretical bound based on Proposition [51 Once again the bound appears 
to be valid. Finally, notice that although in this particular case the proposal distribution is an 
independent uniform distribution, the resulting states of the chain are a sequence of dependent 
samples. In Fig. |5] we show the success rate estimated using the last 2, 000 states of a single 
MCMC run of length 10, 000 (instead of the last state of 2, 000 independent MCMC runs of 
length 10, 000 each). It appears that the success rate increases much faster in this case. This 
is due to the fact that the 2, 000 samples used are now correlated. Figure [6] demonstrates this 




Fig. 4. Left panel: Success rate as a function of simulation step for J = 1 (dot-dash), J — 10 (dotted), J — 100 (solid) and 
J = 200 (dashed). Right panel: Logarithmic plot of success rate for J — 100 (solid) with bound of Theorem [7] (dashed). 
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Fig. 5. Plot of the decay of 1 minus the success rate as a function of the exponent J in linear (left panel) and logarithmic (right 
panel) scales. Both plots show the empirical value based on the last state of 2, 000 independent runs of 10, 000 steps each (solid), 
the empirical value based on the last 2, 000 states of a single 10, 000 step run (dotted), and the theoretical bound (dashed). 
5, 000 independent runs were used for the case J = 300, since the first 2, 000 runs all ended up in O*(0.1) at step 10, 000. 

through a scatter plots of the location of the 2, 000 states used to estimate the success rate for 
J = 100 in the two cases. While for both the 2, 000 independent runs and the single run most 
of the states end up inside the set 6* (0.1) as expected, it is apparent that the chain only moves 
three times in the last 2, 000 steps of the single run; all other proposals are rejected. Plots for 
the case J = 10 are also included. Note that in this case points near the second largest local 
maximum are also occasionally accepted. 
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Fig. 6. Location of thie tiie last state of 2, 000 independent runs of 10, 000 steps each (left column) and the last 2, 000 steps 
of a single 10,000 step run (right column) for J — 100 (top row) and J = 10 (bottom row). The set 0*(O.l) is plotted as a 
dashed ellipse for comparison. 



VI. Comparison with other approaches to stochastic optimization 

In this section we attempt a comparison between the computational features of the MCMC 
approach with those of other state-of-the-art methods for solving the stochastic programming 
problem ^ with finite-time performance bounds. Other methods are typically formulated under 
the assumption that the distribution Px{-;6) does not depend on 6. In this case, U becomes 

U{e) = J g{x,e)P^{dx). (14) 

We stress from the beginning that a direct comparison of the computational complexity of the 
different methods is not possible at this stage, since the different methods rely on different 
assumptions, e.g. some methods require solving an additional optimization problem. Moreover, 
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a satisfactory complexity analysis for the MCMC approach is not yet available. The comparison 
focuses on the number of samples of x required in each method to obtain an approximate value 
optimizer with imprecision 2e, with confidence p, in the optimization of (fT4l) . In Table I we 
compare the growth rates of the total number of samples of x required in each method to obtain 
the desired optimization accuracy as a function of the parameters of the problem. 

In the approach of Shapiro |[30l . IIBTI . N independent samples xi, . . . ,xn, generated according 
to Px, are used to construct the approximate criterion 



then Os is also an approximate value optimizer of U with imprecision 2e with probability at least 
p, provided that is sufficiently high. The growth rates of the required A^, reported in the first 
line of Table I, are based on OTl equation (3.9)]. Notice that this is only a bound on the samples 
required to construct U . It is argued in [|3TI that the optimization of tj within e of optimality can 
be carried out efficiently under convexity assumptions. Nesterov ll32l . [|33l presents a specific 
approach for convex stochastic problems. In this approach the samples generated according to 
are used to construct an estimate of the optimizer of U using a stochastic sub-gradient algorithm. 
The growth rates of the number of samples required to obtain an approximate value optimizer 
of U with imprecision 2e with probability at least p, are reported in the second line of Table I 
and are based on |l33l equation (14)]. Finally Vidyasagar EOl . [[22l| proposed a fully randomized 
algorithm which, as mentioned earlier, is closely related to the one presented in our work. In 
Vidyasagar's approach one generates N independent samples Oi, . . . ,0j^ according to a 'search' 
distribution Pe, which has support on 9, and M independent samples xi, . . . , xm according to 
Px, and sets 




(15) 



i=l 



It is shown in ll30l . OTIl that if 0s is an approximate value optimizer of U with imprecision e 




(16) 



Under minimal assumptions, close to our Assumption [H it can be shown that if 



log 



2 



N > 



1-p 



and M > — log 



AN 



1 



1-p 



(17) 



log 



1 — a 



then 



Pe{{0 e e I U{e) > U{0v) + e})<a 



(18) 
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TABLE I: Growth rates of the number of samples of x required to obtain an approximate value optimizer with imprecision 2e of U, given by 
jl4t . with probability p. In the case of MCMC, the entries of the table represent the number of samples of x required to perform one iteration 
of the algorithm. 



with probability at least p. It is shown in [|20B that potentially tighter bounds can be obtained if 
the family of functions {g(-,9) \ 9 E 0} has the UCEM property. Notice that (fTSi) resembles 
^ in our definition of approximate domain optimizer. The difference is that the measure of 
the set of points which are e better than the candidate optimizer is taken with respect to Pe 
in (fTSi) as opposed to the Lebesgue measure in ([3]). If, and only if, Pe is chosen to be the 
uniform distribution over then (fTSi) becomes virtually equivalent to dS]). In this case, we can 
apply Theorem [T] and obtain the number of samples required to obtain an approximate value 
optimizer. By substituting a with the right-hand side of Q in (flTl) we obtain that if 



fLRY^ 2 . 1 , 4 , , 2 , LR 



log h log log h n log ■ 

1 - p 1 -p e 



(19) 



then 6v is an approximate value optimizer of U with imprecision 2e with probability at least p. 
Notice that now the number of samples on turns out to be exponential in n. 

In the last row of the table, we have included the growth rates of (flOl) . which is the number 
of samples of x which must be generated at each iteration of Algorithm II (which coincides 
with the exponent J). In this case, the total number of required samples of a; is J times the 
number of iterations required to achieve the desired reduction of \\Pe,, — 7r( ■ ; J, (5)||tv. Hence, 
the entries of the last row represent a lower bound, or the 'base-line' growth rates, of the 
total number of required samples in the MCMC approach. In this case, the confidence is p = 
a — \\Pek ~ 7r(- ; J, 5)||tv. Hence, since cr > p, it is sensible to consider the growth rate with 
respect to a instead of p. By comparing the different entries in the table, we notice that (flOl) 
grows slower than or at the same rate as the other bounds. Overall, the comparison reveals that in 
principle there is scope for obtaining MCMC algorithm which, in terms of numbers of required 
samples of x, have a computational cost comparable to those of the other approaches. Here 
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we present a preliminary complexity analysis which shows that the introduction of a cooling 
schedule would eventually lead to efficient algorithms. Notice that in Section |IV] we considered 
a constant schedule ( = J). Here, we assume that takes integer values starting with Ji = 1 
and ending with = J, where J is the smallest integer which satisfies either ^ or (flOl ). In the 
earliest works on simulated annealing, the logarithmic schedule of the type Jk = [log A;J + 1 was 
often adopted |fT3l , lfT4l . IfTTl . Here, we are interested in counting the total number of iterations 
required to complete the cooling schedule when J is given by dH) or (flOl) . Let Ki denote the 
number of iterations in which Jk = i for each i = 1,2,..., J. Hence, the total number of 
iterations is X]/=i ^^^^ Algorithm II, the total number of required samples of x would be 
J2i=i^^i- For the logarithmic schedule Jk = [logfcj + 1 we have Ki = [e*J — [e*~^J. Hence 
we obtain 

1=1 i=l 1=1 

In this case the number of iterations turns out to be exponential in K Hence, a logarithmic 
schedule is not sufficient to obtain efficient algorithms. 

In more recent works ifTSl . lfT6l . [[T9ll . the faster algebraic schedule of the type Jk = [A;" J, 
with a > 0, has been considered. It is shown in ifTSl . lfT6ll . lfT9ll that the choice of the faster 
algebraic schedule requires a sophisticated design of the proposal distribution. For the algebraic 
schedule Jk = [A;"J we have = [(i + — Hence we obtain 



J J 

Y,K. = Y.\-^^ + l)^-V^^^{J+lt^-l. 

i=l i=l 

J J-1 

^iKi ^ J(J+1)^ -E(^ + l)"• 
^=l 4=0 



In this case the number of iterations grows as Jo. Hence, in the case of approximate domain 
optimization, where J is given by Q, the number of iterations grows as (^)% (log^)" and 
(log YT^)"- Iri the case of approximate value optimization, where J is given by (flOl) . the number 
of iterations grows as log i)% (log (log Li?) « and n^. In Algorithm II, the total number 

of samples of x is given by the number of iterations multiplied by J. Hence, in the case of 
approximate domain optimization, it grows as {^Y'^K (log^)^^° and (log yz^)^"^" • In the case 
of approximate value optimization, it grows as (^ log ^Y^K (log yz^)^^°' {log LRY^^ and n^+a 
(notice that, as a increases, the growth rates approach the entries of the last row of Table I). 
Hence, an algebraic schedule leads to algorithms with polynomial growth rates. The convergence 
analysis of Algorithms I and II with an additional algebraic cooling schedule goes beyond the 
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scope of this paper. Here, we limit ourselves to pointing out that the choice of a target distribution 
7r( ■ ;J,5) with a finite J implies that the cooling schedule {Jk}k=i,2,... can be chosen to be a 
sequence that takes only a finite set of values. In turn, this fact should make the study of 
convergence of Pe^. to 7r( ■ ; J, 5) in total variation distance easier than the study of asymptotic 
convergence of Pe^^ to the zero-temperature distribution ttoo- 

VII. Conclusions 

In this paper, we have introduced a novel approach for obtaining rigorous finite-time guaran- 
tees on the performance of MCMC algorithms in the optimization of functions of continuous 
variables. In particular we have established the values of the the temperature parameter in 
the target distribution which allow one to reach a solution, which is within the desired level 
of approximation with the desired confidence, in a finite number of steps. Our work was 
motivated by the MCMC algorithm (Algorithm II), introduced in |[34ll - [[36ll . for solving stochastic 
optimization problems. On the basis of our results, we were able to obtain the 'base-line' 
computational complexity of the MCMC approach and to perform an initial assessment of the 
computational complexity of MCMC algorithms. It has been shown that MCMC algorithms with 
an algebraic cooling schedule would have polynomial complexity bounds comparable with those 
of other state-of-the-art methods for solving stochastic optimization problems. Conditions for 
asymptotic convergence of simulated annealing algorithms with an algebraic cooling schedule 
have already been reported in the literature [[TSl . ffT6l . [[T9l . Our results enable novel research 
on the development of efficient MCMC algorithms for the solution of stochastic programming 
problems with rigorous finite-time guarantees. Finally, we would like to point out that the results 
presented in this work do not apply to the MCMC approach only but do apply also to other 
sampling methods which can implement the idea of simulated annealing P6l . Il47l . 
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Appendix 

In order to prove Theorem [T] we first need to prove a preliminary technical result. 
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Lemma 10: For A C and ^ G M" let d{e, A) := Me'eA II ^ - ^'11- Then, for any /3 > 



1 / ^ 

(i^ := sup sup rf(6', A^) = ^ 77^ 7T 
4r«" e&A VTT L2 V2 



ACM" 
A{A) < /3 



where denotes the complement of A in M". 

In the above Lemma the inner supremum determines the points in the set A whose distance 
from the complement of A is the largest; loosely speaking the points that lie the furthest from 
the boundary of A or the deepest in the interior of A. The outer supremum then maximizes this 
distance over all sets A whose Lebesgue measure is bounded by (3. 

Proof: We show that the optimizers for the outer supremum are 2-norm balls in M"; then, the 
inner supremum is achieved by the center of the ball. 
Consider any set ACW with X{A) < l3 and let 

dA:=snpd{e,A''). (20) 

9£A 

Since \{A) < oo , the supremum (l20l) is achieved by some 9a G A. Without loss of generality we 
can assume that the set A is closed; if not, taking its closure will not affect its Lebesgue measure 
and will lead to the same value for dA- Let B{9, d) denote the 2-norm ball with center in 6 and 
radius d. Notice that by construction B{9A,dA) C A and therefore X{B{6A,dA)) < \{A) < /3. 
Moreover, 

sup d{e,B{eA,dAr)=dA 

eeB(9A,dA) 

achieved at the center, 6a, of the ball. In summary, for any ACM" with X{A) < (3 one can 
find a 2-norm ball of measure at most (3 which achieves s\x\)Q^j^d{6, A^). 
Therefore, 

d*p : = sup sup d{9, A'^) 

ACM" 
A(A) < 13 

sup sup d{e,B{e\ry) 

(e',r) e M" X M+ 0e-B(6»',r) 
A(S(e',r)) < fi 

= sup sup d{9, B{0,ry) 

A(S(0,r)) < 13 

= sup r 

r > 
A(S(0,r)) < f} 
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In the above derivation, the last equality is obtained by recalling that X{B{9, r)) = r". ■ 

ni. [-^j 

We are now in a position to prove Theorem [TJ 
Proof of TheoremU} Let e(e, e) := G | U{9') > U{9) + e} and recall that by definition 

A(e(^,e)) < aA(e). 

Take any 9 E Q. Then either U{9) < U{9) + e or 9 E 6(6', e). In the former case there is nothing 
to prove. In the latter case, according to Lemma \Wi we have: 

[aA(e)]i 

Since 6 is compact and U is continuous, the set Q(9,eY is closed and therefore there exists 
9 E e{9,€y such that 

[aA(e)]i 

Moreover, since U is Lipschitz, \U{9) - U{9)\ < L\\9 - 9\\. Since 9 E Q{9,eY, we have that 
U{9) < U{9) + e, and therefore 

[«A(e)]i (21) 

The claim follows since 9 is arbitrary in 6 and satisfies either U{9) <U (9) + e or (|2T1) . ■ 

Proof of Theorem^ Let a E {0,1] and p E (0, 1] be given numbers. To simplify the notation, 
let Us{9) := U{9) +5 and let tt^ be a normalized measure such that 7is{d9) oc Us{9)\(d9), 
i.e. ns{d9) := 7r{d9; 1, 5). In the first part of the proof we establish a lower bound on 

7r({e G I 'Ks{{9' E I pUs{9') > Us{9)}) < a}; J, 6) . 

Let Ha ■= inf{y | tis{{9 E | Us{9) < y}) > 1 — a}. To start with we show that the set 
G I 7Ts{{9' E I pUs{9') > Us{9)}) < a} coincides with G | Us{9) > py^}. Notice 
that the quantity ns{{9 E | Us{9) < y}) is a non decreasing right continuous function of y 
because it has the form of a distribution function (see e.g. ||48l p. 162], see also [22, Lemma 
11.1]). Therefore we have 7vs{{9 E | U5{9) < y^}) > 1 - a and 

y>pya 7rs{{9' Ee\pUs{9')<y})>l-a ^ 'ks{{9' E & \ pUs{9') > y}) < a . 

Moreover, 

y<pya^ 7^5{{9' Ee\pUs{9')<y})<l-a ^ 7rs{{9' E & \ pUs{9') > y}) > a 



d{9,e{9,e) 



< 



9-9\\<^ -r(-) 



U{9) <U{9)+e+ — 



n^ 



./¥ .2 \2J. 



n 
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and taking the contrapositive one obtains 

^5{{e' ^Q\pUs{e')>y])<a 



y > PVa 



Therefore {9 e Q \ Us{e) > pVa} = {0 eQ \ ns{{e' e 6 | pUs{9') > Us{9)}) < a}. 

We now derive a lower bound on vr ({9 E | Us{9) > p y^}; J, S). Let us introduce the notation 
A^:={9eQ\ Usi9) < y^}, := {9 e Q \ U,{9) > y^}, 5^,, := {9 e Q \ Us{9) < py^} 
and Ba^p := {9 E Q \ Us{9) > pya}- Notice that Ba,p C and Aa C Ba,p. The quantity 
iTs{{9 E 6 I Us{9) < y}) as a function of y is the left continuous version of ns{{9 E | Us{9) < 
y}) [mi P- 162]. Hence, the definition of y^ implies iTs{Aa) < 1 — a and iisiAa) > a. Notice 
that 

5KA^) 



T^5{.Aa) <l- a 
T^5{Aa) > a 

Hence, X{Aa) > and 



{1 + 6)\{A^) 
[J^Usmid9)] 



< 1 - a 



> a 



because U{9) > , 
because U{9) < 1 V0 . 



X{Aa) ^ l-al + 6 



X{Aa) a 6 
Notice that X{Aa) > implies X{Ba,p) > 0. We obtain 



n{{9EQ\Us{9)>py^}-J,6) 



TT 



{Ba,p', J, ^) 



LUsi9yXid9) 



f^Us{9yX{d9) 
Js^ Usi9y X{d9) 



J^^^Us{9yX{d9) + f^^^Us{9yX{d9) 
1 



> 



> 



> 



> 



J^^^Us{9yX{d9) 
ki Us{9yX{d9) 



J^^ Us{9yX{d9) 

1 

p'yiX{B^,p) 

yi X{A^) 
1 



1 

rl - a 1 + 5 



a 
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Since {9 E \ Us{e) > py^} = {0 \ ti&{{0' E 9 | pUsiO') > Usi9)}) < a} the first part 
of the proof is complete. 

In the second part of the proof we show that the set {9 E Q \ ns{{9' E | pUs{9') > 
Us (9)}) < a} is contained in the set of approximate domain optimizers of U with value 
imprecision e := {p^^ — 1)(1 + 5) and residual domain a := a. Hence, we show that 

{9Ee\ 7rsi{9' E I pUsi9') > Usi9)}) < a} C 

{9eQ\ Xi{9' E I U{9') > U{9) +e})<a A(0)} . 

We have 

U{9')>U{9)+i ^ pUs{9')> p[Us{9)+i] pUs{9') > Us{9) 

which is proven by noticing that 

p[Us{9) + i]>Us{9) ^ {l-p)>U{9){l-p) 
and U{9) E [0, 1]. Hence, 

{9' eQ\ pUs{9') > Us{9)} D {9' eQ\ U{9') > U{9) + e} . 

Therefore, 

Tis{{9' eQ\p Us{9') > Us{9)}) <a 7Tsi{9' E | U{9') > U{9) + e}) < a . 

Let Qe-, := G | U{9') > U{9) + e} and notice that 

[ U{9')X{d9') + 6X{Qe,,) 

7rs{{9' E&\U{9')>U{9) + i}) = 



U{9')X{d9') + 6X{Q) 

e 



We obtain 

7is{{9' E I Ui9') > U{9) + e}) <a ^ ~eX{Qe,i) + 6X{Qe,i) < a{l + 5)A(0) 

X{{9' E I U{9') > U{9) + l})<a A(0) . 

Hence we can conclude that 

TX5{{9' E I PU0) > Us{9)}) <a ^ X{{9' E | U{9') > U{9)+i}) < aA(0) 
and the second part of the proof is complete. 
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We have shown that given a e (0, 1], p G (0, 1], e := (p ^ — 1)(1 + 6) and a := 4^ a, then 



7r(e(e,a); J,S) > 



1 



l + p'— — 1 

a 



1 + 6 
i+l + S 



11 + 6 
a e + 6 



1 + 5 



Notice that e G [0, 1] and a G (0, 1] are linked through a bijective relation to p G 1] and 
a G (0, j^]- Hence, the statement of the theorem is eventually obtained by setting the desired 
e = e and a = a in the above inequality. ■ 

To prove Corollary [3] we will need the following fact. 
Proposition 11: For all x > 0, y > 1, 



log 



X + y 

y 



X 



Proof: Fix an arbitrary y > 1. If x = then log 
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< 
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= = 
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log- 
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Proof of Corollary \3} To make sure that vr(9(e, a); J, 5) > a we need to select J such that 

1 



^ _L r i+<^ I-' ri i+s _ ii i±s 

^ Le+l+^J La t+5 S 
1 J 



> a. 



or, in other words. 



6 + 1 + ^ 
1 + S 



> 



a 



1 - a 



11 + b 
a e + b 



1 + b 



It therefore suffices to choose J such that 



e + l + b 



1 + b 



> 



a 1 
1 — a a 



1 + 5 



Taking logarithms 



"^l^S . , r > log + log - + 21og ^— . 

1 + 1 — cr a 



Using the result of Proposition [TT] with x = e and y = 1 + 5 one eventually obtains that it 
suffices to select J according to inequality Q. ■ 

Proof of Proposition Notice that 

a 



>4 



, 1 , 1+5 l+e+5 

log + log - + 2 log -—- - 2 

1 — cr a 5 5{1 + 5) 



d^ l + e + 5 + 2e5 ^ 

dP^^^^ = ^ e5\l + 5Y >° 
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and therefore the function f{6) is convex in 6. The second equation ensures that if f{S) attains 
a minimum then it is unique. To complete the proof we need to show that the equation 



>-7 



(T , 1 ^, 1 + 6 l + e + 6' 
log + log - + 2 log — 2' 



1-a ^5 5{l + 6) 

always has a solution for 5 > 0. To simplify the notation define 



(22) 



m = log + log + log ^ 

Then (l22l) simplifies to = /2(^)- It is easy to see that both fi and /2 are monotone 

decreasing functions of 5 and 

lim/i(5) = lim/2(5) = oo, 

5—^0 5— 5>0 



lim = log / + log ^ > for (7 G (0.5, 1), and hm f2{5) = 0. 



Moreover, as 6 tends to 0, fi{5) tends to infinity more slowly than /2(5) (0(log(l/5) instead of 
0(1/(5)). Therefore the two function have to cross for some 6 > 0. ■ 

Proof of Proposition]^! Let p(-; J, 5) denote the density of 7r(-; J, 5). Consider any 6* G 6*. We 
have: 

[u{e) + 6Y 



p{9;J,5) 



< 



lo'^e me') + 6Yx{de') 
[uie*) + 6Y 

W me') + 6]' x{de') 



^ J,,^^[u{e') + 6yx{de') 1 

A(e) 1 



< 



A(e*) A(e) 
1 1 



/3A(e)- 

Recall that the independent uniform proposal distribution over has density q{9) = j^- 
Hence, from the above inequality we obtain that M = ^ satisfies the inequality in the statement 
of Theorem U\ Therefore, we can write UPe^ — 7r( ■ ; J, (5)||tv < (1 — Z^)^- Hence, (1 — /3)^ < 
7P =^ P0^.(0(e, a); J, 5) > p, from which (fT2l) is eventually obtained. ■ 
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To prove Proposition [9] we first establish a general fact. 

Proposition 12: Let the notation and assumptions of Theorem |2] hold. Letp(-; J, S) denote the 
density of 7r(-; J, 6). For all J > J > and 5 > 



Proof: 

p{9;J,S) 



[uie) + 6y 



1 + 6 



J~J 



p{e-J,6), v^G0. 



Lee P{0') + 6Y\{d9') [u{e) + 6]' J,,^^ [f/(^0 + 5Y \{de') 



[u{e) + 5] 



j-j 



L'ee me') + 6Yx{de'] 



'1 + 5 



S-'-'Ie,ee [Uie') + 6]' X{de'] 
p{9;J,6). 



Proof of Proposition \9\- If we set J = in Proposition [12] we obtain that 



M 



1 + S 
S 



satisfies the inequality in the statement of Theorem U\ Hence, we obtain 



Hence, it suffices to have 



|Pe,-vr(-;J,5)||,v< 



6 



1 + S 



n k 



< IP 



{1 + sy 

in order to guarantee WPe^ — 7r( ■ ; J, (5)||tv < IP- Taking logarithms this becomes 



fclog 



— !^ — \ — I > log — 

(1 + 5y -5J J - ^ 7p 



and, by applying Proposition [TT] with x = 5'^ and y = {1 + Sy — S"^ , we eventually obtain 

Eventually, one obtains (fT3l) by changing the base of the logarithms in the right-hand side of 
dH) from e to and by substituting J with the so-obtained expression in the right-hand side 
of the above inequality. ■ 
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